This notebook was written for course STAT 4243 at Columbia University so I did my best to go through intensive detail and include the theory used for data cleaning, text mining and natural language processing such as sentiment analysis, topic modeling, etc.
My goal is to identify and illustrate similarities and differences identified by text mining. Since it is the spooky text dataset, I pay more attention on the comparision between fear words and joy words. If you are also curious about it, please wait and see!
Let the fun begin!
library(tidytext) # for text analysis
library(wordcloud) # for text visualization
## Loading required package: RColorBrewer
library(ggridges) # for visualization
## Loading required package: ggplot2
library(ggplot2) # for visualization
library(ggfortify) # for visualization
library(Rmisc) # for data analysis
## Loading required package: lattice
## Loading required package: plyr
library(tibble) # for data wrangling
library(dplyr) # for data manipulation
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr) # for string manipulation
library(tidyr) # for data wrangling
library(scales) # for visualization
library(SnowballC) # for text analysis
library(topicmodels) # for text analysis
library(tidyverse) # for data analysis
## ─ Attaching packages ───────────────── tidyverse 1.2.1 ─
## ✔ readr 1.1.1 ✔ purrr 0.2.4
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ─ Conflicts ─────────────────── tidyverse_conflicts() ─
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
library(treemapify) # for visualization
library(topicmodels) # for topic modeling
I also want to use acast function from the reshape2 in a package. However, it causes future errors from my package. So I use reshape2::acast to prevent the conflict. The reshape2 package is for data wrangling.
Then I manually define a function ‘multiplot’, which can also be called from the existing package. I use this function to render multiple ggplots in one image.
multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
require(grid)
plots <- c(list(...), plotlist)
numPlots = length(plots)
if (is.null(layout)) {
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots == 1) {
print(plots[[1]])
} else {
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
for (i in 1:numPlots) {
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
The first part of our journey is data preparation. Data preparation is a necessary, but often tedious activity that is a critical first step in data analytics projects. In this part, I complete 5 important tasks, inluding reading the data, making an overview of the data structure and content, checking missing values, reformating features and data cleaning.
The first thing I’ ll do is load in the data. I load the data from the local file.
spooky <- read.csv("/Users/xiangyu/Documents/GitHub/Spring2018/Project_Starter_Codes/data/spooky.csv", as.is = TRUE)
Then let’s get a quick overview of the data structure and content.
summary(spooky)
## id text author
## Length:19579 Length:19579 Length:19579
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
glimpse(spooky)
## Observations: 19,579
## Variables: 3
## $ id <chr> "id26305", "id17569", "id11008", "id27763", "id12958", ...
## $ text <chr> "This process, however, afforded me no means of ascerta...
## $ author <chr> "EAP", "HPL", "EAP", "MWS", "HPL", "MWS", "EAP", "EAP",...
I find the data set consists of 19579 rows and 3 columns. Each row contains a single scary sentence, such as: ‘This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.’ There are 3 authors in the whole dataset. ‘HPL’ refers to HP Lovecraft, ‘MWS’ refers to Mary Wollstonecraft Shelley and ‘EAP’ refers to Edgar Allan Poe.
Another important thing for us to do is to check the missing values. The concept of missing values is important to understand in order to successfully manage data. If the missing values are not handled properly, then I may end up drawing an inaccurate inference about the data. Due to improper handling, the result obtained by me will differ from ones where the missing values are present.
sum(is.na(spooky))
## [1] 0
Great! There are no missing values. It absolutely saves me a lot of time.
Then I change the author name to a factor, which will help me a lot later on.
spooky$author <- as.factor(spooky$author)
Then I split a column into tokens using the tokenizers package. ‘unnest_tokens’ from tidytext package help us drop punctuation and transforms all words into lower case. In addition, I use anti_join to omit all the stop words, like ‘and’, ‘or’, because these words are not related to the text in our analysis. However, in order to make sure I don’t miss any information given by these stop words, I may analyze them specifically later on.
spooky_wrd <- spooky %>% unnest_tokens(word, text)
spooky_wrd <- anti_join(spooky_wrd, stop_words, by = "word")
Congratulations! We have arrived at the second part of our project, which is the overview of data visualization. In this part, I want to do some basic visualizations in order to get a general picture of our text characteristics, including word clouds, word count, sentence length and word length.
Word is the smallest unit of the whole text. So let’s first compare the characteristics of words among three authors!
In order to explore deeper based on author dependent word clouds, here I want to compare the relative word frequencies between each author. Since we have 3 authors altogether, we have 3 pairs. If the word occurs more than 20 times, then it can be included to the analysis.
frequency1 <-spooky_wrd %>%
count(author, word) %>%
filter(n > 15) %>%
group_by(author) %>%
mutate(freq = n / sum(n)) %>%
select(-n) %>%
spread(author, freq) %>%
gather(author, freq, HPL:MWS) %>%
filter(!is.na(freq) & !is.na(author) & author == "HPL") # This is the HPL:MWS frequency.
f1<-ggplot(frequency1, aes(freq, EAP , color = abs(EAP - freq))) +
geom_abline(color = "red", lty = 1) +
geom_jitter(alpha = 1) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
facet_wrap(~author, ncol = 5) +
labs(y = "EAP", x = NULL) # This plot compare Edgar's word frequency with HP's.
frequency2 <-spooky_wrd %>%
count(author, word) %>%
filter(n > 15) %>%
group_by(author) %>%
mutate(freq = n / sum(n)) %>%
select(-n) %>%
spread(author, freq) %>%
gather(author, freq, EAP:HPL) %>%
filter(!is.na(freq) & !is.na(author)) # This is the EAP:HPL frequency.
f2<- ggplot(frequency2, aes(freq, MWS , color = abs(MWS - freq))) +
geom_abline(color = "red", lty = 1) +
geom_jitter(alpha = 1) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
facet_wrap(~author, ncol = 5) +
labs(y = "MWS", x = NULL) # This plot compare Mary's word frequency with Edgar's and HP's.
multiplot(f1, f2, cols=1)
## Warning: Removed 358 rows containing missing values (geom_point).
## Warning: Removed 358 rows containing missing values (geom_text).
## Warning: Removed 948 rows containing missing values (geom_point).
## Warning: Removed 948 rows containing missing values (geom_text).
Let me first give a brief explanation of the above picture. If the word is at the top left corner, it means this word is highly used by the anthor on the left side beacuse the relative word frequency of this author is much higher. If the word is at the bottom right corner, it means this word is highly used by the anthor on the top beacuse the relative word frequency of this author is much higher.
Therefore, I find that Edgar uses ‘idea’, ‘body’ more often than others. Mary uses ‘father’, ‘hope’ and ‘love’ more than others. HP uses ‘ancient’, ‘street’ more frequently.
I conclude that Mary are more likely to use joy words than other two authors. Does it mean that Mary’s novel is not as scary as others? If yes, then maybe somebody like me can take a try!
Next, I want to choose one of the most popular term-weighting schemes, which is called TF-IDF, not only because it is popular, but also because it is a a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus. I use TF-IDF to adjust for the fact that some words appear more frequently in general. By the way, TF-IDF is short for term frequency–inverse document frequency.
First, I pick the top twenty tf_idf scores in all the text.
tf_idf <- spooky_wrd %>% #caiculate the tf-idf value per word per author
count(author, word) %>%
bind_tf_idf(word, author, n)
spooky_wrd %>%
count(author, word) %>%
bind_tf_idf(word, author, n) %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
top_n(20, tf_idf) %>%
ggplot(aes(word, tf_idf, fill = author)) +
geom_col() +
labs(x = NULL, y = "TF-IDF values") +
theme(legend.position = "right", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9))
Wow! It is surprising that there are so many names here! I can see ‘Perdita’, ‘Adrian’, ‘Justine’, etc. These names appear more frequently than other words.
As always, I pay the closest attention to scary words and joy words. However, because of so many names existing, I can not find enough emotional words here. Thus, I decide to do the same TF-IDF analysis on each author in order to explore deeper.
tf_idf %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(author) %>%
top_n(12, tf_idf) %>%
ungroup() %>%
ggplot(aes(word, tf_idf, fill = author),bin=2) +
geom_col() +
labs(x = NULL, y = "TF-IDF") +
theme(legend.position = "none") +
facet_wrap(~ author, ncol = 1, scales = "free")
Let’s go to find the emotional words here! I can see ‘perfect’ from Edgar, ‘reef’ from HP, and ‘feelings’, ‘misery’ from Mary. I find that relatively, Mary tends to use more emotional words. However, there are still not so many emotional words here. And I am also confused about what these words are used for. So, in the next part, I would like to do the bigrams and trigrams. And maybe sentiment analysis later…
In this part, I want to use bigrams and trigrams to get the meaning of the sentence by separating it into small pieces. Bigrams and trigrams would help me to filter out most name words.
Bigram can regard words two at a time. Each two adjacent words create a bigram, such as “I read”, “read a”, “a book”, “book about”, “about the”, “the history”, “history of”, “of America”.
tfidf_bigram <- spooky_wrd %>%
select(author, word) %>%
unnest_tokens(bigram, word, token = "ngrams", n = 2) %>%
separate(bigram, c("w1", "w2"), sep = " ") %>%
filter(!w1 %in% stop_words$word) %>%
filter(!w2 %in% stop_words$word) %>%
unite(bigram, w1, w2, sep = " ") %>%
count(author, bigram) %>%
bind_tf_idf(bigram, author, n) %>%
arrange(desc(tf_idf))
tfidf_bigram %>%
arrange(desc(tf_idf)) %>%
mutate(bigram = factor(bigram, levels = rev(unique(bigram)))) %>%
group_by(author) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
ggplot(aes(bigram, tf_idf, fill = author)) +
geom_col() +
labs(x = NULL, y = "TF-IDF") +
facet_wrap(~ author, ncol = 1, scales = "free")+
theme(legend.position = "none")
Now, we filter out all the name words here. That’s great! I find that Edgar and HP use more joy words than Mary, because they use ‘ha ha’, ‘heh heh’ more frequently. For Mary, it seems that she really likes ‘lord raymond’!
Trigrams can regard words three at a time. Each three adjacent words create a trigram, such as “I read a”, “read a book”, “a book about”, “book about the”, “about the history”, “the history of”, “history of America”.
tfidf_trigram <- spooky %>%
select(author, text) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
separate(trigram, c("w1", "w2", "w3"), sep = " ") %>%
filter(!w1 %in% stop_words$word) %>%
filter(!w2 %in% stop_words$word) %>%
filter(!w3 %in% stop_words$word) %>%
unite(trigram, w1, w2, w3, sep = " ") %>%
count(author, trigram) %>%
bind_tf_idf(trigram, author, n) %>%
arrange(desc(tf_idf))
tfidf_trigram %>%
arrange(desc(tf_idf)) %>%
mutate(trigram = factor(trigram, levels = rev(unique(trigram)))) %>%
group_by(author) %>%
top_n(5, tf_idf) %>%
ungroup() %>%
ggplot(aes(trigram, tf_idf, fill = author)) +
geom_col() +
labs(x = NULL, y = "TF-IDF") +
theme(legend.position = "none") +
facet_wrap(~ author, ncol = 1, scales = "free")
I find that Edgar tends to use more reduplicated words, such as ‘mile mile mile’, ‘ugh ugh ugh’ and ‘ha ha ha’. It is a unique writing style of Edgar. Mary focuses more on the mental activities of the characters because she talks more about the eyes and hearts. Maybe it is because Mary is the only female author here, so her work has more detailed descriptions than others.
What is sentiment analysis? Sentiment Analysis is the process of determining whether a piece of writing is positive, negative or neutral. It’s also known as opinion mining, deriving the opinion or attitude of a speaker. I use sentiment analysis to explorer deeper into the fear words and joy words of each author.
A word cloud is a graphical representation of word frequency. Instead of making a simple wordcloud, I want to make a comparison one, that is, to compare all these different emotional words in a single word cloud.
spooky_wrd %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
reshape2::acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("black", "red"), max.words = 100)
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): finally could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): happiness could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): excited could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): lovely could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): possessed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): feeling could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): curiosity could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): plague could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): horrible could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): ill could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): green could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): smile could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): mystery could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): boy could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): save could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): chance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): remarkable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): mother could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): affection could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): creature could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): impossible could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): scarcely could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): nose could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): delight could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): fancy could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): deserted could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): guess could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): grief could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("black", "red"), max.words =
## 100): companion could not be fit on page. It will not be plotted.
From this picture, we can find that although we do analysis on the spooky text, fear words are only a small fraction of the whole text. So I decide to pick some strong fear words and joy words randomly and compare the frequency of these words among three authors.
I choose these fear words and joy words purely randomly (not influenced by the outcome above) and want to see the frequency of words within the same category.
#choose five fear words randomly
fear_wrd <- c("death", "fear", "misery", "despair", "plague")
fear<-spooky_wrd %>%
filter(word %in% fear_wrd)%>%
ggplot(aes(word, fill = author)) +
geom_bar(position = "dodge") +
scale_y_log10() +
labs(x = "A few strong fear words")
#choose five joy words randomly
joy_wrd <- c("true", "friend", "love", "beauty", "sun")
joy<-spooky_wrd %>%
filter(word %in% joy_wrd)%>%
ggplot(aes(word, fill = author)) +
geom_bar(position = "dodge") +
scale_y_log10() +
labs(x = "A few strong joy words")
multiplot(fear, joy, cols=1)
For the fear words, I find that Edgar uses ‘death’, ‘deapair’ and ‘fear’ more frequently. HP doesn’t like the word ‘misery’. The frequency of these fear words used by Mary are very similar.
For the joy words, I find that HP doesn’t like to use the word ‘love’. However, Mary uses it very often. The frequency of these joy words used by Mary and Edgar are very similar.
Here I want to get the fraction of fear words by joy words persentence. First, I choose all the fear words and count their fraction among the sample of fear terms plus joy terms, which is # fear / (# fear + # joy). Then I plot the distribution of the fraction of fear words for the three authors:
spooky_wrd %>%
inner_join(get_sentiments("nrc"), by = "word") %>%
group_by(author, id, sentiment) %>%
count() %>%
spread(sentiment, n, fill = 0) %>%
group_by(author, id) %>%
summarise(fear = sum(fear),
joy = sum(joy)) %>%
arrange(id) %>%
mutate(frac_fear = fear/(fear + joy)) %>%
ggplot(aes(frac_fear, fill = author)) +
geom_density(bw = .2, alpha = 0.3) +
theme(legend.position = "right") +
labs(x = "Fraction of Fear Words by Joy Words per sentence")
## Warning: Removed 5210 rows containing non-finite values (stat_density).
The distribution of the fraction of fear words per sentence is clearly skewed towards larger values for HP (green) than in the case of MWS and EAP. The difference between Shelley and Poe is more subtle.
A ‘topic’ consists of a cluster of words that frequently occur together. In this part, I use topic modeling to connect words with similar meanings and distinguish between uses of words with multiple meanings.
# Counts how many times each word appears in each sentence
sent_wrd_freqs <- count(spooky_wrd, id, word)
# Creates a DTM matrix
spooky_wrd_tm <- cast_dtm(sent_wrd_freqs, id, word, n)
spooky_wrd_lda <- LDA(spooky_wrd_tm, k = 16, control = list(seed = 1234))
spooky_wrd_topics <- tidy(spooky_wrd_lda, matrix = "beta")
spooky_wrd_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta) %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 4) +
coord_flip()
I grab 5 words for each topic and here are 16 top topics. I see that for instance topic 1 contains words such as “love” and “mind”, while topic 4 prominently features “death” and topic 5 has the word “beauty”. We also see several overlaps between topics such as “found” or “time”.
I am also curious about how how these topics relate to our three authors. So I build a treemap here.
t1_tdocs <- tidy(spooky_wrd_lda, matrix = "gamma")
t1_tdocs %>%
left_join(spooky_wrd, by = c("document" = "id")) %>%
select(-word) %>%
mutate(topic = as.factor(topic)) %>%
group_by(document) %>%
top_n(1, gamma) %>%
ungroup() %>%
group_by(author, topic) %>%
count() %>%
ungroup() %>%
ggplot(aes(area = n, fill = topic, label = topic, subgroup = author)) +
geom_treemap() +
geom_treemap_subgroup_border() +
geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.6, colour =
"blue", fontface = "italic", min.size = 0) +
geom_treemap_text(colour = "pink", place = "topleft", reflow = T) +
theme(legend.position = "null") +
ggtitle("LDA topics per author")
There are 17 topic related to Mary; 16 topics related to HP; 16 topics related to Edgar. I find that topic 1 and topic 5 are more related to Mary’s novel ;topic 8 and topic 15 are more realted to Edgar’s nove.
In the end, I want to find some interesting things about the stop words. Also I choose 6 stop words randomly and see the frequency of them among three authors.
# I choose 'why', 'what', 'who', 'when', 'where' and 'more'
my_stop <- c("why", "what", "who", "when", "where", "more")
spooky %>%
unnest_tokens(word, text) %>%
filter(word %in% my_stop)%>%
ggplot(aes(word, fill = author)) +
geom_bar(position = "dodge") +
scale_y_log10() +
labs(x = "A few example stop words")
I can not find too many differences on the use of these different stop words among the authors. Thus, it is reasonable to filter them out at first, as what I did.